Workload-Balanced Processing of Top-K Join Queries on Cluster Architectures
نویسندگان
چکیده
The observation that a significant class of data processing and analysis applications can be expressed in terms of a small set of primitives that are easy to parallelize has resulted in increasing popularity of batch-oriented, highly-parallelizable cluster frameworks. These frameworks, however, are known to have shortcomings for certain application domains. For example, in many data analysis applications, the utility of a given data element to the particular analysis task depends on the way the data is collected (e.g. its precision) or interpreted. However, since existing batch data processing frameworks do not consider variations in data utility, they are not able to focus on the best results. Even if the user is interested in obtaining a relatively small subset of the best result instances, these systems often need to enumerate entire result sets, even if these sets contain lowutility results. In this paper, we introduce and describe uSplit, a data partitioning strategy for processing top-k join queries in batch-oriented cluster environments. In particular, we describe how uSplit adaptively samples data from “upstream” operators to help allocate resources in a work-balanced and wasted-work avoiding manner for top-k join processing. Experimental results show that the proposed sampling, data partitioning, and join processing strategies enable uSplit to return top-k results with high confidence and low-overhead (up to ∼ 9× faster than alternative schemes on 10 servers).
منابع مشابه
Parallel Processing of "GroupBy-Before-Join" Queries in Cluster Architecture
SQL queries in the real world are replete with groupby and join operations. This Qpe of queries is often known as “GroupBy-Join ” queries. In some GroupByJoin queries, it is desirable to perform group-by before join in order to achieve better performance. This subset of GroupBy-Join queries is called “GroupBy-Before-Join“ queries. In this paper, we present a study on para 1 le1 iza tion queries...
متن کاملAdaptDB: Adaptive Partitioning for Distributed Joins
Big data analytics often involves complex join queries over two or more tables. Such join processing is expensive in a distributed setting both because large amounts of data must be read from disk, and because of data shuffling across the network. Many techniques based on data partitioning have been proposed to reduce the amount of data that must be accessed, often focusing on finding the best ...
متن کاملThe RankGroup Join Algorithm: Top-k Query Processing in XML Datasets
This project investigates top-k queries in XML datasets. We propose a syntactical addition to XQuery to accommodate top-k XML queries. We then propose a 3-step process to realize these top-k XML queries using a relational database and a new join operator, RankGroup. Our preliminary implementation shows promise in dramatically reducing the running time and number of tuples accessed during such q...
متن کاملOptimizing Multiple Top-K Queries over Joins
Advanced Data Mining applications require more and more support from relational database engines. Especially clustering applications in high dimensional features space demand a proper support of multiple Top-k queries in order to perform projected clustering. Although some research tackles to problem of optimizing restricted ranking (top-k) queries, there is no solution considering more than on...
متن کاملEfficient Execution of Top-K SPARQL Queries
Top-k queries, i.e. queries returning the top k results ordered by a user-defined scoring function, are an important category of queries. Order is an important property of data that can be exploited to speed up query processing. State-of-the-art SPARQL engines underuse order, and top-k queries are mostly managed with a materialize-then-sort processing scheme that computes all the matching solut...
متن کامل